9 research outputs found
Encryption by using base-n systems with many characters
It is possible to interpret text as numbers (and vice versa) if one interpret
letters and other characters as digits and assume that they have an inherent
immutable ordering. This is demonstrated by the conventional digit set of the
hexadecimal system of number coding, where the letters ABCDEF in this exact
alphabetic sequence stand each for a digit and thus a numerical value. In this
article, we consequently elaborate this thought and include all symbols and the
standard ordering of the unicode standard for digital character coding. We show
how this can be used to form digit sets of different sizes and how subsequent
simple conversion between bases can result in encryption mimicking results of
wrong encoding and accidental noise. Unfortunately, because of encoding
peculiarities, switching bases to a higher one does not necessarily result in
efficient disk space compression automatically.Comment: 12 pages, 6 figure
Language classification from bilingual word embedding graphs
We study the role of the second language in bilingual word embeddings in
monolingual semantic evaluation tasks. We find strongly and weakly positive
correlations between down-stream task performance and second language
similarity to the target language. Additionally, we show how bilingual word
embeddings can be employed for the task of semantic language classification and
that joint semantic spaces vary in meaningful ways across second languages. Our
results support the hypothesis that semantic language similarity is influenced
by both structural similarity as well as geography/contact.Comment: To be published at Coling 201
An open problem in computational stemmatology - a model for contamination
In this contribution, two open problems in computational stemmatology are being considered. The first one is contamination, an umbrella term referring to all phenomena of admixture of text variants resulting from scribes considering more than one manuscript or even memory when copying a text. This problem is one of the biggest to date in stemmatology since it implies an entirely different formal approach to the reconstruction of the copy history of a tradition and in turn to the reconstruction of an urtext. (Maas 1937) famously stated that there is no remedy against contamination and (Pasquali and Pieraccioni 1952) coined the terms 'open' vs. 'closed' recensions to distinguish contaminated from uncontaminated. We present a graph theoretical model which formally accommodates traditions with any degree of contamination while maintaining a temporal ordering and give combinatorial numbers and formula on the implication for numbers of possible scenarios
Tools, evaluation and preprocessing for stemmatology
Die vorliegende Arbeit beschäftigt sich mit dem Thema Stemmatologie, d.h. primär der Rekonstruktion der Kopiergeschichte handschriftlich fixierter Dokumente. Zentrales Objekt der Stemmatologie ist das Stemma, eine visuelle Darstellung der Kopiergeschichte, welche i.d.R. graphtheoretisch als Baum bzw. gerichteter azyklischer Graph vorliegt, wobei die Knoten Textzeugen (d.s. die Textvarianten) darstellen während die Kanten fßr einzelne Kopierprozesse stehen. Im Mittelpunkt des Wissenschaftszweiges steht die Frage des Autorenoriginals (falls ein einziges solches existiert haben sollte) und die Frage der Rekonstruktion seines Textes. Das Stemma selbst ist ein Mittel zu diesem Hauptzweck (Cameron 1987). Der durch fßr manuelle Kopierprozesse kennzeichnende Abweichungen zunehmend abgewandelte Originaltext ist meist nicht direkt ßberliefert. Ziel der Arbeit ist es, die semi-automatische Stemmatologie umfassend zu beschreiben und durch Tools und analytische Verfahren weiterzuentwickeln. Der erste Teil der Arbeit beschreibt die Geschichte der computer-assistierten Stemmatologie inkl. ihrer klassischen Vorläufer und mßndet in der Vorstellung eines einfachen Tools zur dynamischen graphischen Darstellung von Stemmata. Ein Exkurs zum philologischen Leitphänomen Lectio difficilior erÜrtert dessen mÜgliche psycholinguistische Ursachen im schnelleren lexikalischen Zugriff auf hochfrequente Lexeme. Im zweiten Teil wird daraufhin die existenziellste aller stemmatologischen Debatten, initiiert durch Joseph BÊdier, mit mathematischen Argumenten auf Basis eines von Paul Maas 1937 vorgeschlagenen stemmatischen Models beleuchtet. Des Weiteren simuliert der Autor in diesem Kapitel Stemmata, um den potenziellen Einfluss der Distribution an Kopierhäufigkeiten pro Manuskript abzuschätzen.
Im nächsten Teil stellt der Autor ein eigens erstelltes Korpus in persischer Sprache vor, welches ebenso wie 3 der bekannten artifiziellen Korpora (Parzival, Notre Besoin, Heinrichi) qualitativ untersucht wird. SchlieĂlich wird mit der Multi Modal Distance eine Methode zur Stemmagenerierung angewandt, welche auf externen Daten psycholinguistisch determinierter Buchstabenverwechslungswahrscheinlichkeiten beruht. Im letzten Teil arbeitet der Autor mit minimalen Spannbäumen zur Stemmaerzeugung, wobei eine vergleichende Studie zu 4 Methoden der Distanzmatrixgenerierung mit 4 Methoden zur Stemmaerzeugung durchgefĂźhrt, evaluiert und diskutiert wird
A Manual for Web Corpus Crawling of Low Resource Languages
Since the seminal publication of âWeb as Corpusâ [1], the potential of creating corpora from the web has been realized for good for the creation of both online and offline corpora: noisy vs. clean, balanced vs. convenient, annotated vs. raw, small vs. big are only some antonyms that can be used to describe the range of possible corpora that can be and have been created. In our case, in the wake of the project Under Resourced Language Content Finder (URLCoFi), we describe a systematic approach to the compilation of corpora for low (or under) resource(d) languages (LRL) from the web in connection with a free eLearning course funded by studiumdigitale at Goethe University, Frankfurt. Despite the ease of retrieval of documents from the web, some characteristics of the digital medium introduce certain difficulties. For instance, if someone was to collect all documents on the web in a certain language, firstly, the collection could only be a snapshot since the web constantly changes content and secondly, there would be no way to ascertain completeness. In this paper, we show ways to deal with such difficulties in search scenarios for LRLs presenting experiences springing from a course about this topic.[1] A. Kilgarriff and G. Grefenstette, âWeb as corpus,â in Proceedings of Corpus Linguistics 2001, 2001, pp. 342â344
A practitionerâs view: a survey and comparison of lemmatization and morphological tagging in German and Latin
The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then compared with old(er) methods and implementations for coarse-grained POS tagging, as well as fine-grained (morphological) POS tagging (e.g. case, number, mood). We examine to what degree recent advances in tagger development have improved accuracy â and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-of-domain evaluation. Out-of-domain evaluation is particularly pertinent because the distribution of data to be tagged will typically differ from the distribution of data used to train the tagger. Pipeline tagging is then compared with a tagging approach that acknowledges dependencies between inflectional categories. Finally, we evaluate three lemmatization techniques
Handbook of Stemmatology: History, Methodology, Digital Approaches
Stemmatology studies the aspects of textual criticism using genealogical methods to analyse a set of copies from a text whose autograph is lost. As an art (ars) stemmatology has its main goal in editing, and thus presenting to the reader, such a text in the most satisfactory way; as a more abstract discipline (scientia) it is interested in the general principles of how texts change in the process of being copied. This handbook provides the first coverage of the entire field: theoretical and practical aspects of traditional and modern digital methods. Thirty eight experts from all involved fields joined forced to write the book which covers in forty one sections topics from material aspects of text traditions, through methods of traditional textual criticism, to modern digital methods used in the field. The two final chapters provide closer views of how the approach towards texts and textual criticism has developed in some well-defined disciplines of textual scholarship and compare methods used in other fields dealing with "descent with modification", respectively. Illustrations with many practical examples from a wide range of disciplines are provided to render the content more accessible. The intended readership comprises both students of various fields involved with texts and more advanced scholars
Handbook of Stemmatology: History, Methodology, Digital Approaches
Stemmatology studies the aspects of textual criticism using genealogical methods to analyse a set of copies from a text whose autograph is lost. As an art (ars) stemmatology has its main goal in editing, and thus presenting to the reader, such a text in the most satisfactory way; as a more abstract discipline (scientia) it is interested in the general principles of how texts change in the process of being copied. This handbook provides the first coverage of the entire field: theoretical and practical aspects of traditional and modern digital methods. Thirty eight experts from all involved fields joined forced to write the book which covers in forty one sections topics from material aspects of text traditions, through methods of traditional textual criticism, to modern digital methods used in the field. The two final chapters provide closer views of how the approach towards texts and textual criticism has developed in some well-defined disciplines of textual scholarship and compare methods used in other fields dealing with "descent with modification", respectively. Illustrations with many practical examples from a wide range of disciplines are provided to render the content more accessible. The intended readership comprises both students of various fields involved with texts and more advanced scholars